wine_rating = read_csv("./wine_rating.csv")
price_country=
wine_rating |>
group_by(country)|>
summarise(avg_by_country = mean(price)) |>
filter(!is.na(avg_by_country))
world_map_data <-
price_country |>
mutate(text = paste(country, "<br>Avg Price: $", round(avg_by_country, 2)))
fig_country_all <- plot_ly(
data = world_map_data,
type = "choropleth",
locations = ~country,
locationmode = "country names",
z = ~avg_by_country,
text = ~text,
colorscale = "Viridis"
)
fig_country_all <-
fig_country_all|>
layout(
geo = list(
showframe = FALSE,
projection = list(type = 'mercator')
),
title = "Average Wine Price by Country"
)
fig_country_all
This world map visualizes average wine prices by country, offering a overview of the global wine market from the dataset. The data reflects average wine prices across all the different countries in the dataset. Each country is shaded according to its average wine price, with darker shades indicating lower average prices, while lighter shades represent higher prices. Notably, the United Kingdom stands out with the lightest shades, signifying the highest average wine price on the map, while Mexico is represented with the darkest shades, indicating the lowest average price.
price_country <-
price_country |>
arrange(desc(avg_by_country))
top_10_country =
head(price_country, 10) |>
mutate(rank = 1:10) |>
transform(rank = 1:10)
country10_map_data <- top_10_country |>
mutate(text = paste("Rank: ", rank, "<br>Country: ", country, "<br>Avg Price: $", round(avg_by_country, 2)))
fig_country_10map <- plot_ly(
data = country10_map_data,
type = "choropleth",
locations = ~country,
locationmode = "country names",
z = ~avg_by_country,
text = ~text,
colorscale = "YlGnBu"
)
fig_country_10map <-
fig_country_10map |>
layout(
geo = list(
showframe = FALSE,
projection = list(type = 'mercator')
),
title = "Top 10 Countries Based on Average Wine Price"
)
fig_country_10map
This map visually presents the top 10 countries with the highest average wine prices. The countries are ranked based on their average wine prices, with darker shades indicating lower prices. Each country is labeled with its rank, name, and the corresponding average price in US dollars. Based on this map, the top three most expensive countries are the UK, France, and the US.
price_region=
wine_rating |>
group_by(country,region)|>
summarise(avg_by_region = mean(price)) |>
filter(!is.na(avg_by_region))|>
arrange(desc(avg_by_region))
top_10_regions <- head(price_region, 10)
fig_region10 <- plot_ly(
data = top_10_regions,
type = "bar",
x = ~reorder(region, -avg_by_region),
y = ~avg_by_region,
color = ~country,
text = ~paste("Country: ", country, "<br>Avg Price: $", round(avg_by_region, 2)),
marker = list(size = 10)
)
fig_region10 <-
fig_region10 |>
layout(
title = "Top 10 Regions Based on Average Wine Price",
xaxis = list(title = "Region"),
yaxis = list(title = "Average Wine Price"),
showlegend = TRUE
)
fig_region10
This bar chart displays the top 10 regions with the highest average wine prices. Each bar on the x-axis represents a region, sorted in descending order of average price, and its length corresponds to the average wine price on the y-axis. Bars are color-coded by countries, and the legend enhances context by identifying them. France prominently leads the list of the most expensive areas, with eight of the top 10 regions located in the it.
price_winery=
wine_rating |>
group_by(country,region, winery)|>
summarise(avg_by_winery = mean(price)) |>
filter(!is.na(avg_by_winery))|>
arrange(desc(avg_by_winery))
price_winery_dis <- wine_rating %>%
group_by(country, region, winery) %>%
mutate(avg_by_winery = mean(price)) %>%
filter(!is.na(avg_by_winery)) %>%
select(country, region, winery, price, avg_by_winery) %>%
arrange(desc(avg_by_winery))
top_10_wineries <- price_winery_dis %>%
group_by(avg_by_winery) %>%
nest() %>%
arrange(desc(avg_by_winery)) %>%
head(10) %>%
unnest(cols = data)
# Create a Plotly boxplot with color-coded markers by region
boxplot_plotly <- plot_ly(
data = top_10_wineries,
type = "box",
x = ~winery,
y = ~price,
color = ~region, # Color by region
colors = "Set3" # Choose a color scale (Set3 provides clear distinctions)
)
# Customize layout
boxplot_plotly <- boxplot_plotly %>%
layout(
title = "Price Distribution for Top 10 Wineries",
xaxis = list(title = "Winery"),
yaxis = list(title = "Price"),
showlegend = TRUE # Display legend for region colors
)
# Display the Plotly plot
boxplot_plotly
This boxplot illustrates the price distribution of the top 10 wineries, with each box color-coded by region. Each box represents a winery, featuring a median line indicating the central price, box bounds denoting the interquartile range, and whiskers showing the price spread. However, some wineries appear as short lines, which might be due to limited data points or homogeneous prices in these wineries.
price_us=
wine_rating |>
filter(country=="United States")|>
select(country,region, winery,price)
price_us_200=
price_us|>
filter(price<200)
# Create a histogram
histogram_plotly_us <- plot_ly(
data = price_us_200,
type = "histogram",
x = ~price,
nbinsx = 10, # Adjust the number of bins as needed
marker = list(color = "lightblue")
)
# Customize layout
histogram_plotly_us <- histogram_plotly_us |>
layout(
title = "Wine Price Distribution in the US (Filtered under 200)",
xaxis = list(title = "Price"),
yaxis = list(title = "Frequency")
)
# Display the histogram
histogram_plotly_us
This histogram illustrates the distribution of wine prices in the United States, with a specific focus on wines priced under $200. The x-axis represents the price range, while the y-axis denotes the frequency of wines falling within each price category. This visualization provides a quick overview of the frequency distribution of wine prices in the United States under the specified threshold. Notably, the histogram reveals that the majority of wines in the United States are concentrated in the price range of 0 to 19.99 dollars.
ggplot_rp_tf = wine_rating |>
ggplot(aes(x = log(price), y = rating, color = year)) +
geom_point()
ggplotly(ggplot_rp_tf)
lm_rp = lm(rating ~ log(price) + year, data = wine_rating)
broom::tidy(lm_rp) |>
knitr::kable(digits = 3)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 3.067 | 0.114 | 26.917 | 0.000 |
| log(price) | 0.272 | 0.002 | 114.773 | 0.000 |
| year1988 | -0.221 | 0.228 | -0.971 | 0.331 |
| year1989 | -0.407 | 0.180 | -2.258 | 0.024 |
| year1990 | -0.358 | 0.180 | -1.990 | 0.047 |
| year1991 | -0.507 | 0.228 | -2.227 | 0.026 |
| year1992 | -0.461 | 0.161 | -2.860 | 0.004 |
| year1993 | -0.364 | 0.180 | -2.024 | 0.043 |
| year1995 | -0.305 | 0.151 | -2.028 | 0.043 |
| year1996 | -0.209 | 0.139 | -1.498 | 0.134 |
| year1997 | -0.062 | 0.136 | -0.458 | 0.647 |
| year1998 | -0.028 | 0.136 | -0.207 | 0.836 |
| year1999 | -0.105 | 0.122 | -0.857 | 0.391 |
| year2000 | -0.062 | 0.122 | -0.504 | 0.614 |
| year2001 | -0.085 | 0.127 | -0.670 | 0.503 |
| year2002 | 0.023 | 0.128 | 0.183 | 0.855 |
| year2003 | -0.054 | 0.125 | -0.436 | 0.663 |
| year2004 | -0.069 | 0.119 | -0.580 | 0.562 |
| year2005 | -0.086 | 0.115 | -0.748 | 0.455 |
| year2006 | -0.035 | 0.116 | -0.299 | 0.765 |
| year2007 | -0.045 | 0.117 | -0.390 | 0.696 |
| year2008 | -0.065 | 0.115 | -0.565 | 0.572 |
| year2009 | -0.058 | 0.115 | -0.505 | 0.614 |
| year2010 | -0.076 | 0.115 | -0.663 | 0.507 |
| year2011 | -0.057 | 0.114 | -0.500 | 0.617 |
| year2012 | -0.026 | 0.114 | -0.229 | 0.819 |
| year2013 | -0.053 | 0.114 | -0.467 | 0.640 |
| year2014 | -0.054 | 0.114 | -0.473 | 0.636 |
| year2015 | -0.008 | 0.114 | -0.070 | 0.944 |
| year2016 | -0.004 | 0.114 | -0.038 | 0.970 |
| year2017 | -0.003 | 0.114 | -0.027 | 0.979 |
| year2018 | 0.021 | 0.114 | 0.185 | 0.853 |
| year2019 | 0.111 | 0.114 | 0.977 | 0.329 |
| year2020 | 0.293 | 0.180 | 1.629 | 0.103 |
| yearN.V. | -0.051 | 0.114 | -0.450 | 0.653 |
resid_rp = wine_rating |>
modelr::add_residuals(lm_rp) |>
ggplot(aes(x = log(price), y = resid)) + geom_point()
ggplotly(resid_rp)
bootstrap_rp = wine_rating |>
modelr::bootstrap(n = 1000) |>
mutate(
models = map(strap, \(df) lm(rating ~ log(price) + year, data = df)),
results = map(models, broom::tidy)) |>
select(results) |>
unnest(results) |>
filter(term == "log(price)") |>
ggplot(aes(x = estimate)) + geom_density()
ggplotly(bootstrap_rp)
# Assuming your summarized data is stored in a variable named wine_summary
wine_summary <- wine_rating %>%
group_by(country, year, categories) %>%
filter(number_of_ratings > 2000) %>%
summarise(mean_rating = mean(rating),
sd_rating = sd(rating),
n = n()) %>%
ungroup()
#remove missing values and any duplicates
wine_rating =
wine_rating |>
na.omit() |>
distinct() |>
mutate(name = gsub("\\d", "", winery), year = as.numeric(year),
name = iconv(name, from = "", to = "UTF-8", sub = ""),
winery = iconv(winery, from = "", to = "UTF-8", sub = ""),
region = iconv(region, from = "", to = "UTF-8", sub = ""))
# Plotly scatter plot with loess smoothing
plot_ly(wine_rating, x = ~price, y = ~rating, type = 'scatter', mode = 'markers', color = ~categories) %>%
layout(title = "Average wine price by category and production region",
xaxis = list(title = "Price"),
yaxis = list(title = "Rating"))
This scatter plot shows the relationship between wine prices and ratings across different categories. Points are colored by category, including red, rosé, sparkling, and white wines. The graph suggests that there is a wide range of prices within each wine category, and there doesn’t appear strong correlation between price and rating. Some high-rated wines are available at moderate prices, while there are also expensive wines with lower ratings. However, we can see that for the red wine, we could see the trend that the red wine with higher price have higher rating.
# #Within Countries, highest rating regions
wine_rating_summary <- wine_rating %>%
group_by(country, region) %>%
filter(number_of_ratings > 2000) %>%
summarise(mean_rating = mean(rating),
sd_rating = sd(rating),
n = n()) %>%
ungroup() %>%
arrange(region, desc(mean_rating)) %>%
top_n(20)
# Plotly bar plot
plot_ly(wine_rating_summary, x = ~reorder(region, -mean_rating), y = ~mean_rating, type = 'bar', color = ~country) %>%
layout(title = "Average wine rating by region (Within countries)",
xaxis = list(title = "Region"),
yaxis = list(title = "Mean Rating"))
The graph only shows the regions from which their wines received more than 2000 ratings. This bar chart compares the average wine rating by region within various countries. Regions are ordered by their mean rating. This visualization indicates that certain regions consistently produce higher-rated wines, but it also reflects the diversity within countries. For instance, wines from regions like Napa Valley, Barolo, and Bordeaux may have higher average ratings compared to other regions. Italy by far tops the list of the countries with the highest rated wine producing regions. Out of the top 27 regions in terms of mean-rating, almost half (13) of the wine producing regions are from Italy.
# Data Preparation
rating_analysis_data <- wine_rating |>
filter(region %in% c("Napa Valley", "Napa County", "California")) |>
filter(categories != "Rose") |>
group_by(region, categories) |>
summarise(mean_rating = mean(rating)) |>
spread(key = categories, value = mean_rating)
# Convert to matrix (required for heatmap)
rating_matrix <- as.matrix(rating_analysis_data[,-1])
rownames(rating_matrix) <- rating_analysis_data$region
# Plotly Heatmap
fig <- plot_ly(x = colnames(rating_matrix),
y = rownames(rating_matrix),
z = rating_matrix,
type = "heatmap",
colorscale = "Viridis") %>%
layout(title = 'Rating Analysis by Wine Category in Napa Valley, Napa County, and California',
xaxis = list(title = 'Wine Category'),
yaxis = list(title = 'Region'))
fig
Narrowing down to the ratings of individual regions and the wines that they produce, this heatmap shows the average ratings of different wine categories specifically within Napa Valley, Napa County, and California as a whole. The color intensity reflects the mean rating.For most of us, the region that we are familiar with should be Napa County, Napa Valley. From the graph, the Red wines from Napa Valley, had the highest mean-ratings compared to the other two, just as the White category of wines from Napa County had the highest mean rating. For the white wine, the production from Napa Valley had the highest mean-ratings.
# Filter data if needed
welcoming_valleys_data <- wine_rating |>
filter(number_of_ratings > 2000) |>
group_by(region, categories) |>
summarise(mean_rating = mean(rating),
number_of_ratings = number_of_ratings,
mean_price = mean(price)) |>
ungroup() |>
filter(mean_rating > 4.3, mean_price > 0) # Ensures that mean_price is greater than 0
# Create a Plotly Bubble Chart
fig <- plot_ly(data = welcoming_valleys_data,
x = ~mean_price,
y = ~mean_rating,
size = ~number_of_ratings,
color = ~region,
colorscale = "Viridis",
type = 'scatter',
mode = 'markers',
marker = list(sizemode = 'diameter', sizeref = 0.7, opacity = 0.4)) %>%
layout(
title = 'Most Welcoming Wine Valleys by Wine Category',
xaxis = list(title = 'Mean Price', range = c(0, max(welcoming_valleys_data$mean_price))),
yaxis = list(title = 'Mean Rating')
)
# Print the figure
fig
This bubble plot shows the relationship between the mean-prices and mean-ratings of wines, for only the wines that had more than 2000 ratings, and whose mean-rating was greater than 4.3 all grouped by the wine category. Regions with larger bubbles and higher placements on the Y-axis (mean rating) could be interpreted as more “welcoming” due to their combination of high ratings and a significant number of ratings, suggesting popular approval.The mean price on the X-axis gives an additional dimension, indicating if the quality comes with a higher cost. The region represented by the large, orange bubble (represents ‘Bolgheri Superiore’ region) seems to be the most “welcoming,” as it has a high mean rating and a significant number of reviews. In the plot, there is a wide range of mean prices, but a cluster of regions has mean ratings in the narrow range of approximately 4.3 to 4.6.Also,not all high-rated wines are expensive, and not all expensive wines are highly rated, indicating that price is not the sole determinant of quality as perceived by raters.
all_wine_cate = wine_rating |>
filter(categories != "Varieties") |>
group_by(categories) |>
summarise(number = n())
all_wine_cate
## # A tibble: 4 × 2
## categories number
## <chr> <int>
## 1 Red 8666
## 2 Rose 397
## 3 Sparkling 1007
## 4 White 3764
price_plot= wine_rating |>
plot_ly(y = ~price, x= ~categories, color = ~ categories, type = "box", colors = "viridis") |>
layout(yaxis = list(range = c(0, 1500),dtick = 50))
price_plot
rating_plot= wine_rating |>
plot_ly(y = ~rating, x= ~categories, color = ~ categories, type = "box", colors = "viridis") |>
layout(yaxis = list(range = c(0, 5), dtick = 0.2))
rating_plot